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ABSTRACT 

Combinatorial interactions among transcription 
factors (TFs) are critical for integrating diverse intrin- 
sic and extrinsic signals, fine-tuning regulatory 
output and increasing the robustness and plasticity 
of regulatory systems. Current knowledge about 
combinatorial regulation is rather limited due to the 
lack of suitable experimental technologies and bio- 
informatics tools. The rapid accumulation of ChlP- 
Seq data has provided genome-wide occupancy 
maps for a large number of TFs and chromatin modi- 
fication marks for identifying enhancers without 
knowing individual TP binding sites. Integration of 
the two data types has not been researched exten- 
sively, resulting in underused data and missed 
opportunities. We describe a novel method for dis- 
covering frequent combinatorial occupancy patterns 
by multiple TFs at enhancers. Our method is based 
on probabilistic item set mining and takes into 
account uncertainty in both types of ChlP-Seq data. 
By joint analysis of 108 TFs in four human cell types, 
we found that cell-type-specific interactions among 
TFs are abundant and that the majority of enhancers 
have flexible architecture. We show that several 
families of transposable elements disproportionally 
overlap with enhancers with combinatorial patterns, 
suggesting that these transposable element families 
play an important role in the evolution of combina- 
torial regulation. 

INTRODUCTION 

In higher eukaryotes, transcription factors (TFs) rarely 
operate by themselves, but rather directly or indirectly 
interact with specific partner TFs or chromatin regulators 



when binding to enhancers. It has been estimated that 
roughly 75% of all metazoan TFs heterodimerize with 
other factors (I). Classical examples of combinatorial 
TF regulation include the paradigmatic 'even-skipped 
stripe 2' enhancer for body patterning in fly (2) (involving 
7 TFs and 34 sites in a 1.7-kb region), the endol6 gene 
enhancer for endoderm specification in sea urchin (3) 
(involving 19 TFs and 56 sites in a 2.2-kb region), the 
MyfS gene enhancer for muscle development in mouse 
(4) and the Ifng gene enhancer for the production of inter- 
feron gamma in human and mouse (5) (involving 6 TFs 
and 4 sites in a 55-bp region). Our current knowledge 
about combinatorial regulation, including rules governing 
the molecular architecture, evolutionary, spatial and 
temporal dynamics of enhancers, has relied heavily on 
studies of these classical enhancers (6,7). 

The rapid accumulation of ChlP-Seq data has provided 
genome-wide occupancy maps for a large number of TFs. 
By clustering these TF occupancy maps, several recent 
studies have uncovered hundreds of genomic loci that 
are co-occupied by multiple TFs in various species and 
cell types (8-12), suggesting the abundance of combina- 
torial regulation. However, given the number of TFs 
in mammalian genomes [~2,000, (13)], our current 
knowledge represents only the tip of the iceberg. On the 
technical side, clustering-based approaches to finding 
combinatorial TF occupancy patterns have a few short- 
comings. First, most analyses use binary presentation of 
binding peaks, which makes them vulnerable to noise in 
ChlP-Seq data. Second, because most combinatorial regu- 
latory events occur at enhancers, focusing enhancers will 
enhance the signal-to-noise ratio. So far most clustering 
studies have not incorporated such a constraint. 

The discovery of unique chromatin signatures 
associated with enhancers greatly facilitates enhancer 
mapping without knowledge about the locations of indi- 
vidual TFs (14-16). In addition, such an approach is 
well suited for finding cell- and developmental-specific 
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enhancers and providing information about eniiancer 
action in the native genomic context. Given the increasing 
abundance of TF and chromatin modification ChlP-Seq 
data, a better approach to finding combinatorial patterns 
would be analyzing multiple TF ChlP-Seq data sets using 
enhancers defined by chromatin signatures as the genomic 
location constraint. An advantage of such an approach is 
the reduction of spurious clusters of TF peaks at non- 
enhancer sites and presumably non-functional. 

We propose a novel probabiHstic algorithm to discover 
frequent combinatorial occupancy patterns (FCOPs) 
involving multiple TFs at enhancers, taking into account 
noise in both types of ChlP-Seq data. Our method differs 
from previous DNA-motif-centered approaches by 
directly using ChlP-Seq data and thus avoiding complica- 
tions associated with DNA motif analysis (e.g. motif 
quaUty, the need for binding site cutoff). To our best 
knowledge, this is the first principled approach to 
integrating TF occupancy and chromatin modification 
ChlP-Seq data to study combinatorial TF interactions. 

By applying our algorithm to a set of 108 TFs in four 
human cell types, we identify a number of combinatorial 
TF occupancy patterns that occur frequently in the 
genome. Additional analyses of identified FCOPs reveal 
that cell-type-specific TF interactions are abundant and 
that the majority of enhancers have flexible architecture. 
In addition, we show that several famihes of transposable 
elements (TEs) play an important role in the evolution 
of complex enhancers occupied by multiple TFs. 

MATERIALS AND METHODS 

Discover FCOP of multiple TFs by probabilistic frequent 
itemset mining of uncertain data 

Our method borrows idea from frequent itemset mining 
(FIM) (17). In FIM, customers' transaction data were col- 
lected. Each transaction contains a hst of products that 
are called items. FIM discovers customer buying habits by 
finding associations between different items that cus- 
tomers place in their 'shopping baskets'. An itemset is 
frequent when it occurs in a minimal number of transac- 
tions. Here we equate enhancers to transactions and the 
set of TF binding sites in an enhancer to the itemset. By 
this analogy, the problem of identifying FCOPs becomes 
the problem of identifying frequent itemsets. To deal with 
noise in ChlP-Seq data, we use occurrence probabilities 
of Transcription Factor Binding Site (TFBSs) and enhan- 
cers in our framework. Traditional FIM does not take 
into account uncertainty associated with transactions 
and items. Doing so gives rise to an uncertain transaction 
database, in which the support of an itemset is uncertain 
and is defined by a discrete probabihty distribution 
function. We introduce a novel framework to mine such 
a probabihstic transaction database. 

Problem definition 

Given are a set of 'N' enhancers predicted by CSl-ANN 
(18), E — {e\,e2, ■ ■ ■ , cn], each of which has a probabihty 
of being a true enhancer, pe^, and genome-wide binding 
peaks of a set of 'M' TFs identified by ChlP-Seq, 



T — {t\,t2, - ■ ■ , Im], respectively. For a TF binding peak 
in enhancer e,-, we use p,.^,,. to denote the binding prob- 
ability of the TF to the enhancer. For a given set of TFs, 
Te-, c r, whose peak centers fall within the enhancer, e,-, 
we consider the itemset Tp. as being supported by the 
enhancer <?,. Given a minimum support threshold, 
minSup, the goal of the algorithm is to exhaustively 
search for non-redundant frequent combinatorial TF oc- 
cupancy patterns X = {X\ , X2, ■ ■ ■ , Xi} (Z,- c T and V Xj, 
Xj U Xi ^ Xj, where i ^j, i,j = 1, 2 ... L), whose frequent- 
ness probabilities satisfy a given threshold 'a'. 

Probability calculation for enhancers and TF binding peaks 

In traditional FIM, it is assumed that both transactions 
and items of a transaction occur with certainty. However, 
in our case, the information captured in transactions and 
items has some degree of uncertainty because the existence 
of an enhancer and/or a TF binding peak is inferred from 
ChlP-Seq data. To deal with such uncertainties, we assign a 
probabihty to each enhancer predicted by CSI-ANN and 
each TF binding peak. Because the CSl-ANN algorithm 
outputs probabihties of predicted enhancers, it is straight- 
forward to use these probabihties directly. For TF ChlP- 
Seq data, given its better performance compared with other 
peak callers, we first use MACS with its default parameter 
setting to call binding peaks (19). To compute the prob- 
abihty of a TF binding peak, we use min-max normaliza- 
tion to transform the MACS fold enrichment score into a 
range of [0.5, 1]. We choose this range because peaks called 
using default MACS parameter are relatively strong peaks. 
We also tested the performance of our algorithm with the 
normalization ranges of [0.3, 1] and [0.7, 1]. Our result 
shows that the performance is not sensitive to the choice 
of normalization range (Supplementary Figure S3). To 
avoid peaks with extremely high fold enrichment that 
could skew the normalized probabihty scores, we 
truncate the fold enrichment score at 95 percentfle and 
set the probability score of such peaks to 1 . 

Algorithm to compute the frequentness probability of a 
combinatorial pattern 

Denote X as a combinatorial pattern observed in a set of 
enhancers. Pk{X) is the probability of the pattern having 
support k, k e {0,. . . , N} and P>k(X) is the probabihty of 
the pattern having a support at least k. Xis considered as a 
frequent pattern when P>,mnSup(^ > «■ We consider two 
types of uncertainties: those associated with enhancer pre- 
dictions and those associated with TF binding peak calls. 
If we only consider the uncertainties of TF binding peaks 
in an enhancer (i.e. enhancer is certain), the probability of 
a combinatorial pattern X supported by an enhancer e,- 
can be calculated as follows: 

Pg.{X) = YitexPti,^!- Considering both types of 
uncertainties, the overall probability of pattern X having 
exactly k support can be calculated as 

Pk(X)^ J2 (n(^-.W*^0* n (^-Pe.(^*Pe)] 

SQE,\S\=k\e,eS e,eE~S / 

(1) 

where p^, is the probability of being a true enhancer. 
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Recall that we are interested in the probability that an 
item set occurs in at least minSup transactions. In other 
words, we need the probability that the support of X is at 
least k. Denote this probability as P>k{^, which can be 
computed by the following equation. 

SQE,\S\>k\eieS aeE-S / 

(2) 

With a large number of enhancers and TF binding 
peaks, computation of P>i^(X) by brute-force enumeration 
is inefficient. The time complexity is exponential with 
respect to the number of transactions. Here we adopt a 
dynamic programming-based algorithm first proposed 
by Bernecker et al. (20) to compute the frequentness prob- 
ability recursively. Briefly, define Pkj(X) as the probability 
that k of / enhancers contains pattern X and P>kj{X) as 
the probability that at least k of I enhancers contains the 
pattern. Then, 

P>k,l = Pe,i^*Pei * P>k-l,l-li^ 

+(l-Pe.(^*P,,)*P>k,l-l(r, 

where 

P^Qj = 1 V 0 < / < A^, and P^,,j = 0 V A- > / (4) 

Starting from Equation (4), we can recursively calcu- 
late P>mmS,ip(^ using Equation (3) till k = minSup, 
I = N. By definition, the initial values of the dynamic 
programming matrix are -P>o,o — 1 and -P>i,o = 0. 
Using this dynamic programming scheme, the 
computation of the frequentness probabihty requires 
at most 0[N*minSup) = 0{N) time and at most 0(AO 
space. 

Automatic determination of pattern-specific minimal 
support threshold 

Because different TPs have different numbers of binding 
sites in the genome, it is expected that patterns involving 
TPs with more binding sites have more support than 
patterns involving TPs with fewer binding sites. 
Therefore, a universal minimal support is not appropriate 
for computing frequentness probability. Instead, we deter- 
mine pattern-specific minimal support thresholds that 
have the same frequentness probability in randomized 
input data. Such a probability is analogous to a _P-value. 
In this sense, all pattern-specific minimal support 
thresholds are 'normalized' to have a P<0.01. We 
generate a set of permutated background transactions 
based on the binding frequencies of the specific TPs 
involved in a pattern. We do so by randomly 
redistributing the peaks of a TP across the set of enhan- 
cers. In this way, the number of binding peaks for each TP 
is unchanged in the permutated data, but the correlation 
among TPs is destroyed. Por a given pattern and a range 
of minimal support thresholds, we then compute a set of 
frequentness probabilities in the permutated data as the 
_P-values. We pick the final pattern-specific minimal 
support threshold as the one that gives a P-value of 



0.01. As an example, minimum support thresholds for 
pairwise patterns calculated this way are shown in 
Supplementary Pigure S4. We also calculated minimum 
support thresholds for higher-order patterns. 

See Supplementary Methods for the rest of method 
descriptions. 

RESULTS 

A large fraction of enhancers are occupied by multiple 
TFs 

Using CSI-ANN (18) and cell-type-specific histone 
modification ChlP-Seq data, we predicted 16 128, 
23 731, 23 586 and 3 1327 enhancers in GM12878, 
K562, HepG2 and HI cells, respectively. This set of pre- 
dictions has high quahty, as roughly 70% of them overlap 
with at least one of three other genomic marks for enhan- 
cers: distal DHS, sequence conservation and distal p300 
site (Supplementary Pigure SI). 

To assess the extent of combinatorial TP binding at 
enhancers, we overlapped TP binding peaks with the set 
of predicted enhancers. We obtained ChlP-Seq data from 
ENCODE for 62, 65, 44 and 39 TPs for GM12878, K562, 
HepG2 and HI ceUs, respectively (Supplementary Pigure 
S2 and Supplementary Table SI). Pigure lA shows the 
cumulative distributions of enhancers that contain 
various number of distinct TP binding peak(s) within 
the 1 kb enhancer window. As can be seen, 61, 69, 72 
and 39% of enhancers are occupied by at least two TPs 
for GM 12878, K562, HepG2 and HI cells, respectively, 
suggesting that combinatorial binding is prevalent. 

Method overview 

We introduce a probabihstic algorithm for identifying 
frequent combinatorial occupancy patterns by multiple 
TPs at enhancers, termed PCOP in this study 
(Pigure IB). Our method is motivated by the concept of 
PIM (21). However, unlike traditional PIM, our method 
considers the probabihties of both transactions and items 
as a way to deal with uncertainties in ChlP-Seq data. Our 
method has the following three components: (i) calcula- 
tion of probabilities associated with enhancers (transac- 
tions) and TP binding peaks (items); (ii) automatic 
determination of pattern-specific minimum support 
threshold; and (hi) calculation of frequentness probabihty 
of candidate PCOPs using an uncertain transaction 
database and dynamic programming. The input data 
consists of genome-wide location information for enhan- 
cers and multiple TPs and probability values that are 
associated with each type of sequence. The only adjustable 
parameter of our method is the frequentness probability 
threshold for frequent combinatorial patterns, a. A higher 
a value will give rise to a set of predictions with higher 
specificity but lower sensitivity. To pick a default value for 
the parameter, we ran our method with a range of a 
values. We found that a value of 0.5 gives a balanced 
sensitivity and fraction of supported predictions 
(Supplementary Pigure S5). This default value was used 
for the following analyses. Software implementing our 
method is freely available to academic use on request. 
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Figure 1. Combinatorial occupancy of enhancers by multiple transcription factors. (A) Cumulative distribution of enhancers overlapping with 
multiple TF peaks. (B) Overview of our method to identify frequent combinatorial TF occupancy patterns using probabilistic FIM. p^^, probability 
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Performance comparison with alternative algorithms 

A couple of groups have used traditional FIM to identify 
frequent TF combinatorial patterns (22,23). Both 
approaches used TF binding sites defined by DNA motif 
scan instead of ChlP-Seq peaks as items. For transactions, 
Morgan et al. used 100-bp windows across the genome 
and Sun et al. used a contiguous DNA sequence that 
contain binding sites of all TFs under consideration for 
a given pattern. No confidence measure for either TF 
peaks or enhancers was used in either previous studies. 



We compared the performance of traditional FIM and 
our method using Receiver Operating Characteristic 
(ROC) Curve. To evaluate the performance of the 
methods, we manually curated a set of gold standard TF 
interactions identified using experimental protocols. To 
make the comparison meaningful, we used the same set 
of transactions and items as the input, i.e. enhancers and 
TF binding peaks described earher in the text. As shown 
in Figure 2A, our method has significantly higher area 
under the curve values than the traditional FIM [all 
P < 2.2 X 10~'*, statistical test using the method of 
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DeLong et al. (24)]. We note that the area under the curve 
values in these plots are the lower bounds of the actual 
values because of the limited coverage of the gold 
standard set of interactions. 

Predicted FCOPs in four cell types 

Using the default frequentness probabihty threshold of 
0.5, we discovered 320, 300, 194 and 62 FCOPs in 
GM 12878, K562, HepG2 and HI cells, respectively. The 
fraction of TFs involved in FCOPs ranges from 35.5% for 
GM 12878 cell to 70.4% for HepG2 cell. Likewise, the 
fraction of enhancers that support FCOPs ranges from 
30.1% for K562 ceU to 67.7% for HepG2 ceU (Table 1). 
On average, each FCOP is supported by 1025, 1086, 1062 
and 991 enhancers in GM 12878, K562, HepG2, and HI 
cells, respectively (Supplementary Table S4). These 
numbers are much higher than the minimum numbers of 
support used during the FCOP search, suggesting that the 
discovered FCOPs are not due to random chance 
(Supplementary Figure S6). 

We further corroborated our predictions using evolu- 
tionary conservation and known experimentally derived 
TF interactions. Interestingly, for all four cell types, a 
large portion of the frequent TF patterns are supported 
by enhancers that are significantly more conserved than 
the full set of enhancers in the genome (P<0.01, 
hypergeometric test, Supplementary Figure S7). For 
instance, 20.4% of enhancers that support FCOPs in 
GM 12878 cell are more conserved than the full set of en- 
hancers in this ceU type. These fractions are 34.8%o, 35.3% 
and 50% for K562, HepG2 and HI cells, respectively. This 
higher level of conservation is likely due to the require- 
ment of conserving multiple TF binding to the enhancers. 
Besides a higher conservation level, all four sets of dis- 
covered patterns are significantly enriched for known TF 
protein-protein interactions {P < 0.05, one-sided binomial 
test) (Table 1), providing additional supporting evidence 
for the discovered patterns. 

Cell-type-speciflc TF interactions are common 

Cell-type-specific interactions among TFs play a critical 
role in differential gene expression (25). But systematic 
investigation of this important issue is limited. Taking 
advantage of our set of predicted FCOPs, we asked to 
what extent they are cell-type-specific. For simplicity, we 
focused on pairwise interactions within FCOPs. We first 
expanded predicted combinatorial patterns into all 



possible TF pairs. In total we obtained 668 frequent TF 
pairs in the four cell types. To avoid complication due to 
missing data, we focused on the set of 54 TFs that are 
shared by at least two cell types in the ChlP-Seq data. 
There are 14, 20 and 20 TFs that are shared by 4, 3 and 
2 cell types, respectively (Supplementary Figure S2). 
Among these shared TFs, we predicted 282 interactions 
that occur in the FCOPs. Only 44 pairs (15.6%)) are shared 
by two cell types and no pair is shared by >2 cell types. 
For the shared pairs, they are supported by strong 
evidence of co-occurrence frequencies in the relevant cell 
types. For instance, the co-occupancy by JunD and REST 
is observed at 11.9 and 3.8% of enhancers that support 
the respective FCOPs in K562 and HepG2 cells. In stark 
contrast, co-occupancy by the same two factors is not fre- 
quently observed at enhancers in GM 12878 and HI cells 
(Figure 3). By the same token, for the unique pairs, they 
are supported by the lack of co-occupancy in ceU types in 
which no interactions are predicted (Supplementary 
Figure S8). 

Even for TF pairs that occur in more than one cell 
types, the sets of target genes co-regulated by the TF 
pairs in different ceU types may differ for multiple 
reasons, such as relative arrangement of the two TFs, add- 
itional regulators that may be involved and different chro- 
matin context of the involved enhancers. Thus, we asked 
to what extent genes regulated by the same TF pairs are 
involved in the same biological process. We performed 
Gene Ontology enrichment analysis on the target genes 
of the enhancers supporting 44 TF pairs shared between 
2 cell types. We found 21 TF pairs (47%) whose target 
gene sets do not have any shared GO biological process 
terms (Supplementary Table S5). In other words, although 
47% of the TF pairs are shared by two cell types, genes 
controlled by them have different functions in the two 
cell types. 

In summary, our result suggests that pairwise TF inter- 
actions are fairly dynamic across cell types. Even for TF 
interactions that occur in multiple cell types, the sets of 
co-regulated genes could differ substantially in different 
cell types, leading to diverse regulatory outputs. 

The majority of combinatorial patterns have flexible 
architecture 

The spacing and arrangement of TF binding sites in an 
enhancer affects its regulatory activity, which is known as 
the cis-regulatory grammar of enhancers. Two models of 



Table 1. Summary statistics of predicted combinatorial TF patterns in 


four cell types 






Statistics 


GM12878 


HI 


HepG2 


K562 


Number of predicted patterns 

Number of supported pairwise interactions (%) 

Number of supported patterns (/"-value) 

Number of TF pairs, triplets and quadruplets 

Number of TFs in patterns (%) 

Number of enhancers in patterns (%) 


320 

109 (5.8) 
183 (0) 
38, 210, 72 
22 (35.5) 
7076 (56) 


62 

80 (10.8) 
21 (5.9E-9) 
61, 1, 0 
21 (53.9) 
13391 (64.6) 


194 

68 (7.2) 
35 (4.0E-9) 
111, 79, 4 
31 (70.4) 
13838 (67.7) 


300 

166 (8.0) 
100 (8.2E-59) 
137, 154, 9 
41 (63.1) 
9073 (30.1) 



A TF set is considered to be supported by known TF interactions if any of the pairwise interactions between two TFs in the TF set was supported by 
known TF interactions. P-values are calculated using binomial distribution. 
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enhancer architecture have been proposed based on a 
number of studies. At one extreme, the enhanceosonie 
model postulates that a strict pattern of TF binding site 
arrangement is required for proper enhancer function 
(26,27). A well-known example of this model is the inter- 
feron beta gene (IFN-p) in which there are fixed binding 
order and site spacing among the six TPs, ATF-2, c-Jun, 
IRF-3, lRF-7, p50 and RelA (5). On the other hand, the 
billboard model states that (28) binding sites in enhancers 
are flexibly disposed without a strictly defined overall 
architecture. Many developmental enhancers have such 
billboard features (29). A weU-known example of bill- 
board enhancer is the even-skipped stripe 2 enhancer for 
body patterning in fly (2). 

To better understand the relative abundance of enhan- 
cers belonging to each model and the general architectural 
features of enhancers, we conducted a couple of systematic 
investigations, taking advantage of our set of predicted 
FCOPs. 

We first asked what fraction of our combinatorial 
patterns involves TFs with preferred binding site 
spacing. To this end, we first performed de novo motif 
finding for each TF using its ChlP-Seq data (see 
Supplementary Methods). Next, using the set of motifs 
and the SPAMO tool (30), we identified TF pairs in com- 
binatorial patterns that have a preferred binding site 
distance that is statistically overrepresented in the set of 
enhancers supporting the combinatorial patterns. Of the 
179 frequent TF pairs in GM12878 cefl, 31 of them 
(17.3%) have preferred binding site distance (SPAMO 
P<0.05). These numbers are 101 of 279 (36.2%), 80 of 
190 (42.1%) and 26 of 64 (40.6%) for K562, HepG2 and 
HI cells, respectively (Figure 4A, Supplementary Table 
S6). Most TF pairs have only one preferred distance and 
95% of the binding site distances are shorter than 126 bp 
(Supplementary Figure S9). The short distance between 
TF pairs suggests physical interactions among them. For 
aU TF pairs with preferred binding distances, 23.3% are 
supported by known TF physical interactions, compared 
with 16.4% for TF pairs without preferred binding site 
distance. Taken together, our result suggests that the 
majority of TFs in combinatorial patterns do not have 
preferred binding site spacing requirement. 

Besides preferred binding site distance, we also 
investigated preferred binding order among TFs in com- 
binatorial patterns. For each pattern, of all possible 
orderings of the involved TFs, we identified preferred 
orderings as those that are overrepresented in the set 
of supporting enhancers (see Supplementary Methods 
for detafls). At a P-value cutoff of 0.01, we found that 
33.4, 28.7, 21.7 and 8.1% of combinatorial patterns 
exhibit a preferred binding order for GM12878, K562, 
HepG2 and HI cells, respectively (Figure 4B and 
Supplementary Table S4). 

Enhancers belonging to the enhanceosome model have 
both strict TF binding order and binding site spacing. We 
next examined what fraction of combinatorial patterns has 
such stricter architecture. To this end, we considered a 
pattern belonging to the enhanceosome model if it has 
a preferred TF binding order and at least one TF pair in 
the pattern has a preferred binding site distance. In total, 
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Figure 4. The majority of enhancers do not have strict architectural requirements. (A) Fraction of frequent TF pairs that have preferred binding site 
spacing in enhancers. Preferred binding site spacing is determined using the SPAMO tool. (B) Fraction of frequent combinatorial patterns in which 
member TFs have preferred binding order in the enhancers. 



we found that 18.4, 17, 15.5 and 3% of the patterns fall 
into the enhanceosome model in GM 12878, K562, HepG2 
and HI cells, respectively (Supplementary Table S4). 

In conclusion, our systematic analysis suggests that 
enhancers with strict architecture only accounts for a 
small fraction. Both the binding site distance and order 
of arrangement among the bound TFs are flexible for 
the majority of complex enhancers that involve multiple 
bound TFs. 

Many combinatorial patterns have significant overlap 
with TEs 

TEs in the human genome are significantly associated with 
TF binding sites. In several cases their expansion in the 
genome led to a substantial rewiring of the regulatory 
network (31-34). However, almost all studies so far 
focused on binding sites of individual TFs and httle is 
known about the extent of overlap between TEs and com- 
binatorial TF binding sites. To address this issue, we 
searched for TEs that significantly overlap with FCOPs. 
We examined 29 families of TEs belonging to four major 
classes, DNA transposon, long-terminal repeat (LTR) 
retrotransposon, long interspersed element and short 
interspersed element. When compared with all enhancers 
in the genome, we found a remarkable enrichment of a 
number of TE families among enhancers that support 
FCOPs (/'<0.01, hypergeometric test. Supplementary 
Table S4). As an example, we found that the combinator- 
ial pattern [NR2F2, STAT5A, TALI] significantly 
overlaps with the ERVL family of LTR retrotransposon 
(Figure 5 A and B). Figure 5C summarizes the fractions of 
combinatorial patterns that significantly overlap with the 
four major classes of TEs. For both K562 and HI cells, 
LTR retrotransposons overlap with the largest fractions 
of FCOPs. For HepG2, all four classes of TEs contribute 
roughly equal fraction of overlap with FCOPs. Besides the 
most frequently overlapping TEs, it is also worth knowing 
which TF combinations most frequently overlap with 
TEs. The top three FCOPs that have most significant 
overlap with TEs are [JUND, TALI, TEAD4], [CEBPB, 
TALI, TEAD4], [STAT5, TALI, TEAD4] for K562 cell, 
[CEBPB, FOXAl, HNF4A], [FOXAl, FOXA2, 
HNF4A], [HDAC2, HNF4A, HNF4G] for HepG2 cell 



and [BCLllA, SPl], [NANOG, SPl], [NANOG, TCF12] 
for HI cell. 

When broken down into individual TE families, for 
each cell type, the top three TE famihes that overlap 
with most FCOPs are ERVL, ERVl, L2 for K562 cell, 
Alu, ERVL, LI for HepG2 cell and ERVl, ERVL-MaLR, 
ERVL for HI cell (Figure 5C inset). This result suggests 
that ERV families overlap with a disproportionate 
fraction of combinatorial binding patterns. ERV 
(endogenous retrovirus) is a major family of LTR retro- 
transposons. They are thought to have played an import- 
ant role in the evolution of mammahan genomes. Among 
the four cell types, HI cell stands out in that all TEs that 
overlap with combinatorial patterns belong to ERV 
families, suggesting the prominent role of ERVs in 
shaping the transcriptional regulation network of 
embryonic stem cells. 

DISCCUSSION 

Our method integrates both histone modification and TF 
ChlP-Seq data in a probabiUstic framework. The histone 
modification data are used to generate constraints in the 
form of enhancers. Furthermore, the histone modification 
signature used here has been shown to be associated with 
active enhancers (16,35). Thus, by focusing on combina- 
torial TF peaks in active enhancer regions, we reduce the 
chance of finding spurious clusters of TF peaks in the 
genome. To evaluate the added benefit of the histone 
modification data, we ran our algorithm by replacing the 
enhancers with moving windows of the same size across 
the genome. We found that the performance decreased 
significantly (Supplementary Figure SIO). 

An alternative to our FIM-based approach is clustering 
analysis. It has been used to conduct joint analysis of 
ChlP-Seq peaks of multiple TFs (8,9,11). The advantage 
of clustering analysis is its relative simplicity. However, 
they do not directly address the noise in the underlying 
ChlP-Seq data. Second, for most clustering algorithms, 
determining the number of clusters is a difficult problem. 
Finally, the statistical significance of resulting clusters 
is typically not assessed in clustering analysis. In compari- 
son, our method addresses all three issues. It treats 
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Figure 5. Overlap between combinatorial patterns and repetitive elements. (A) Three examples of overlap between the TFs NR2F2, STAT5A, TALI 
and the ERVL transposable elements. For TFs, ChlP-Seq peaks are shown. (B) Composite view of 1233 K562 enhancers supporting the combina- 
torial pattern [NR2F2, STAT5A, TALI] and their overlap with 111 ERVL family of transposable elements. For each TF, the average fold enrich- 
ment values at each position in the 1 kb enhancer window are plotted. For TE, the number of overlapped ERVL sequences at each position in the 
1 kb enhancer window is plotted. (C) Fraction of combinatorial patterns that overlap with four major classes of transposable elements. DNA, DNA 
transposon; LTR, long-terminal repeat retrotransposon; LINE, long interspersed element; SINE, short interspersed element. Inset, breakdown of 
overlap into TE families. 



ChlP-Seq data probabilistically. It is deterministic in 
terms of number of discovered patterns. Finally, it deter- 
mines the statistical significance of candidate frequent 
patterns. 

By analyzing a compendium of 108 TF ChlP-Seq data 
sets in four cell types, we found that the majority of en- 
hancers have flexible architecture in terms of the arrange- 
ment and spacing of constituent TF binding sites. Among 
the four cell types, we found that embryonic stem cell 
(HI), rather than the terminally differentiated cells, has 



the largest fraction of enhancers with flexible architecture. 
This is reminiscent of the observation in fly that many 
developmental enhancers tend to have a more flexible 
architecture than enhanceosome (29). In this sense, 
the architectural difference may reflect different func- 
tional requirements of the two classes of enhancers. 
Enhanceosomes may represent a type of regulatory 
switches that are mainly responsible for generating gene 
expression patterns that are relatively simple, whereas 
enhancers with flexible architecture may be responsible 
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for generating complex patterns of expression such as 
those during development. 

We demonstrate that LTR/ERV retrotransposons 
overlap with a disproportionate fraction of combinatorial 
TF binding regions, especially in HI embryonic stem cells. 
This is interesting given that ERV elements have been 
observed to contribute to the rewiring of transcriptional 
regulatory networks in ESCs and placenta (36-38). The 
unusual high percentage of ERV-derived combinatorial 
patterns in HI cells is likely a consequence of the permis- 
sive chromatin state found in ESCs (39,40). It has been 
suggested that the manipulations that were initially 
exerted by the ancestral viruses on their hosts to bypass 
these antiviral control mechanisms have also facilitated 
their co-option into enhancer elements (41). Along this 
hne, it has been shown that stem ceh potency fluctuates 
with endogenous retrovirus activity in mouse (42). 

In summary, we introduce a powerful computational 
method for uncovering combinatorial interactions 
among TFs. As the amount of genome-wide localization 
data continues to accumulate for various regulatory 
proteins, our method will prove increasingly useful for 
dissecting combinatorial gene regulation by the action of 
TFs as well as other types of regulatory proteins. 
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